WDN:AAR:th 7/31/2000 4239-55412 NIH (DHHS) Ref. No. E-209-99/0 Express Mail No. EL#121363839US 

Date of Deposit: July 31, 2000 

We Claim: 

^ i^thod of predicting a biological activity of a molecule, comprising: 
obtaining spectral data for a test compound; 

comparing t&e spectral data of the test compound to a pattern of spectral data 
associated with a biological activity, derived not exclusively from the assigned 
spectral data of a traming set of compounds having a known biological activity; 

detecting simiParities between the pattern of spectral data associated with a 
biological activity of tHe training set and a pattern of spectral data for the test 
compound to determineWhether the test compound is predicted to share the biological 
activity. \ 

2. The methocl of claim 1 wherein the spectral data are obtained without 
first correlating the spectral data with corresponding structural features. 

3 The method of claim 1 wherein the pattern of spectral data associated 
with a biological activity is derived without first correlating the spectral data with 
corresponding structural features. 

4. The method of claim 1, wherein the pattern of spectral data of the 
training set is a pattern obtained bAseparating the spectral data of the training set of 
compounds into sub-spectral units. \ 

5. The method of claim 4\ wherein the pattern of spectral data of the test 
compound is obtained by separating they spectral data of the test compound into 
substantially the same sub-spectral units Wo which the spectral data of the training set 
is separated, v \ 

6. r^e method of claim 1, wherein the spectral data is one type of 
spectral data. \ 

7. VThe method of claim 6, wherein the spectral data comprises one of 
nuclear magnetfc resonance, mass spectral, infrared, ultraviolet-visible, fluorescence, 
or phosphorescence data. 
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8. The method of claim 1, wherein the spectral data is a composite of 
different types of spectral qjta. 

9. \ The method of claim 8, wherein the different types of spectral data 
comprise twa or more of the group consisting of nuclear magnetic spectroscopy 
(NMR), mass spectroscopy (MS), infrared (IR) spectroscopy, and ultraviolet-visible 
(UV-Vis) spectroscopy. 

10. The method of claim 1, wherein the spectral pattern of the test 
compound and theVspectral pattern of the training set are segmented into sub-spectral 
units, and the spectWl data of the training set is scaled to normalize the importance of 

10 different signals witmn the spectral data of the training set prior to deriving a pattern 
associated with a biolqgical activity. 

11. The metnod of claim 10, wherein the scaling is auto-scaling. 

12. The meth^ of claim 10, wherein the spectral data of the training set is 
weighted to emphasize signals that are important for determining the endpoint class of 

15 compounds in the training ^et before deriving a pattern associated with a biological 
activity. 

13. The method ofVclaim 12 wherein the weighting is Fisher- weighting. 

14. The method of c\aim 1, wherein detecting similarities between the 
pattern of spectral data associated with a biological activity of the training set and the 

20 pattern of spectral data for the tes^compound comprises performing computer 
implemented pattern recognition. 

15. The method of claim L wherein detecting similarities between the 
pattern of spectral data associated witft a biological activity of the training set and the 
pattern of spectral data for the test compound comprises detecting relative intensities 

25 of signals associated with one or more oAthe sub-units of the spectrum of the training 
set, and detecting relative intensities of signals associated with the same one or more 
sub-units of a spectrum of the test compounci. 
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16. The method of claim 15, wherein the relative intensities are canonical 
\ variate factors of the spectral data associated with a biological activity of the training 

set and the spectral signals\of the test compound. 

17. \ The method of claim 1, wherein the method is computer implemented. 



computer implemented system for predicting a biological activity of 
a test compound, comprising: 

receiving\as input spectral data for a test compound; 

receiving as input spectral data of a training set of compounds having a known 
biological activity; Wd 
10 comparing the pattern of spectral data of the training set associated with the 

biological activity to the spectral pattern of the test compound to determine whether 
the spectral pattern of me test compound matches the spectral pattern associated with 
the biological activity oV the training set. 

19. The computer implemented system of claim 18, wherein comparing the 
15 spectral patterns comprises comparing the spectral patterns with computer 

implemented pattern recognition programs. 

20. The computer implemented system of claim 19, wherein the spectral 
data for the test compound and the spectral data for the training set are divided into 
substantially identical spectral bins, so that a signal within individual spectral bins is 

20 compared between spectral patterns of the training set associated with the biological 
activity and the test compound. 

21. The computer implemented system of claim 18, wherein the spectral 
patterns are obtained by inputting spectral data selected from the group consisting of 
nuclear magnetic resonance data, mass spectral data, infrared data, untraviolet-visible 

25 data, fluorescence data, phosphoresce^nce data, and composites of two or more such 
spectral data. 

22. The computer implemented system of claim 21, wherein the spectral 
data for the training set are converted into canonical variates associated with the 
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biological \ctivity of the training set, and the spectral data for the test compound are 
compared to\he canonical variates of the training set spectral data. 

23. Yhe computer implemented system of claim 22, wherein the biological 
activity is bindina affinity to a hormone receptor, and the canonical variates for the 
training set includeVeaks in bins that are associated with hormone receptor binding of 
a pre-selected affinityS 

24. The compter implemented system of claim 23, wherein the spectral 
data comprise nuclear magnetic resonance data and mass spectral data. 

25. A computer readable medium having stored thereon instructions for 
performing the actions Vf claim 1 . 

26. A computW readable medium having stored thereon instructions for 
performing the actions of claim 18. 

A mkhod for predicting a biological, chemical, or physical property of 



a molecule, comprisif 

providing spectJ^l data segmented into spectral sub-units, for a plurality of 
training compounds; 

inputting the segmehjed spectral data and endpoint data into a pattern- 
recognition program; 

training the pattern-recognition program with the segmented spectral data and 
endpoint data to establish a rel^o^ship between the spectral sub-units of the 
segmented spectral data and 

providing segmentedlspectral data for^a test compound that is segmented into 
substantially the same spectrahsuliouiit^^ the spectral data of the training 
compounds; and 

comparing the relationship between the spectral subunits of the segmented 
spectral data and the endpoint to the spectral subunits of the test compound's 
segmented spectral data to predict the endpoint orsthe test compound. 

28. The method of claim 27, wherein knowledge of the structures of the 
training compounds and the test compound are not necbssarily known beforehand. 
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29. T\ie method of claim 27 wherein the segmented spectral data of the 
training set is autpscaled and Fisher-weighted before training the pattern recognition 
program. 

30. The Vnethod of claim 27 wherein the spectral data is chosen from the 
group consisting of nuclear magnetic resonance data, mass spectral data, infrared 
data, UV-Vis data, fluorescence data, phosphorescence data, and composites thereof. 

3 1 . The m^od of claim 30 wherein the spectral data is chosen from the 
group consisting of ^^Q NMR data, EI MS data, and composites thereof. 

32. The metnpd of claim 27, wherein the endpoint is a ligand-target 
molecule-binding affmit 

33. The methoU of cl^ifn 32, v^erein the ligand-target molecule binding 
affinity is an estrogen-reciptdr binding affinity. 

34. The methQ(i\of claim 27, wherein the endpoint is selected from the 
group consisting of: / 

a measure ofyoiodegrkdability; 
a measure of toxicity; \ 
participation in a metab9lic pathway; 
a partition coefficient; 
a reaction rati 
a quantum yielcfT 
a measure of phototoxicity; 
an equilibrium constant; and 
a site of reaction on a moleculaA structure. 

35. The method of claim 34, ^(herein the partition coefficient is the 
octanol- water partition coefficient. 

36. The method of claim 27, whe\ein non-spectral structure descriptors that 
do not necessarily depend upon structural knowledge beforehand are provided for the 
test compound and the training compounds; used to establish a relationship between 
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the segmentedlispectral data, the non-spectral structure descriptors, and the endpoint 
for the training compounds; and used to predict the endpoint of the test compound. 

37. Tttie method of claim 36, wherein the non-spectral structure descriptors 
are chosen from the group consisting of partition coefficients, solubilities, relative 

5 acidities, relative basicities, pKa, pKb, reaction rates, and equilibrium constants. 

38. Thevmethod of claim 37, wherein the partition coefficient is the 
octanol- water partition coefficient. 

39. The niethod of claim 27, wherein non-spectral structure descriptors that 
do depend upon structural knowledge beforehand are provided for the test compound 

C5 10 and the training compounds; used to establish a relationship between the segmented 

spectral data, the non-spectral structure descriptors, and the endpoint for the training 
^^f compounds; and used to\predict the endpoint of the test compound. 

yi 40. The metho\l of claim 39, wherein the non-spectral descriptors are 

calculated using a quantuni mech^c^-or^ectrostatic potential method. 
Z^^^ 15 A method for uWg spectral data as a set of structure descriptors for a 

compound that does not n^e^sarily require knowledge of the compound's structure 
beforehand, comprising 
providing spec 
segmenting me spectral 
20 A method for estabMimg a relationship between spectral data and a 

biological, chemical, oi^^^rSical property, comprising: 
providing spectral data; 
segmenting the spectral data into\ 

detecting patterns in the bins of th^spectral data that are associated with the 
25 property. 

43. The method of claim 42, furth^ comprising detecting corresponding 
patterns in spectral data of test compounds to select test compounds having the 
property. ^ 
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44. The[ method of claim 43, wherein the test compounds are mixtures of 
compounds. 

45. The rAethod of claim 42 including one or more of: auto-scaling the 
segmented data; and weighting the segmented spectral data. 

46. The memod of claim 45 wherein weighting of the segmented data 
comprises Fisher-weignting of the segmented spectral data. 

47. The method of claim 1, wherein the biological activity of the test 
compound is predicted without reference to a chemical structure of the test 
compound. \ 

A method for^ establishing a spectral data activity relationship, 

comprismg; 

providing endpoint datk for a plurality of compounds; 
providing spectral data for a plurality of compounds; 
segmenting the spectral a^ta^rflie^iilraU of compounds into bins; 
autoscaling the numerical j^ta obtained from the spectral features within each 
of the bins; 

Fisher- weighting the dkta wiftiin each of the bins; and 




or 



endpoint using a means for pattern 



spectral data is selected from the 
JV-Vis data, fluorescence data. 



correlating informatic 
recognition. 

49. The method df claim 47 \^he 
group consisting of NMR data^MS^ata, \R < 
phosphorescence data, and composites thereof. 

50. The method of claim 48 wherein two or more types of spectral data are 
normalized to each other in a composite. 

y^l. A method for determining the structural features of a plurality of 
compounds that contribute to determining a partiqilar endpoint property exhibited by 
the compounds, comprising: 

providing segmented spectral data for the plur^ity of compounds; 

providmg endpoint data for the plurality of compounds; 
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establishing a spectra^ data-activity relationship by identifying the segmented 
spectral features that bias toward increased endpoint values and the segmented 
spectral features that bias to warn decreased endpoint values; and 

identifying the structuralVeature leading to the segmented spectral features that 
bias toward increased or decreased endff5mt values for the plurality of compounds. 

52. The method of claim SI wherein the segmented spectral data is selected 
from the group consisting of NMR/kata, MS data, IR data, UV-Vis data, fluorescence 
data, phosphorescence data, and conmosites thereof. 

53. The method of clami 52 wl;i^ein two or more types of segmented 
spectral data are normalized to eacVtrtner in a composite. 

54. The method of claim 52 wherein the segmented spectral data is a 
composite of ^^C NMR and EI MS data. 
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